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Abstract. Statistical methods have been widely employed in many practical natural 
language processing applications. More specifically, complex networks concepts and 
methods from dynamical systems theory have been successfully applied to recognize 
stylistic patterns in written texts. Despite the large amount of studies devoted to 
represent texts with physical models, only a few studies have assessed the relevance 
of attributes derived from the analysis of stylistic fluctuations. Because fluctuations 
represent a pivotal factor for characterizing a myriad of real systems, this study 
focused on the analysis of the properties of stylistic fluctuations in texts via topological 
analysis of complex networks and intermittency measurements. The results showed 
that different authors display distinct fluctuation patterns. In particular, it was found 
that it is possible to identify the authorship of books using the intermittency of 
specific words. Taken together, the results described here suggest that the patterns 
found in stylistic fluctuations could be used to analyze other related complex systems. 
Furthermore, the discovery of novel patterns related to textual stylistic fluctuations 
indicates that these patterns could be useful to improve the state of the art of many 
stylistic-based natural language processing tasks. 


1. Introduction 

The application of concepts from Physics in textual analysis has increasingly become 
widespread The use of entropy concepts is perhaps one of the most known 

examples of adapting methods from Physics in language-based models [8]. In recent 
years, physicists have proposed novel approaches to tackle several natural language 
processing problems [9HI9]. The emergence of fundamental principles of organization 
common to all languages has been studied in terms of the least-effort principle |20j . 
Other studies have been devoted to the analysis of word frequency distributions [2IH25], 
which has led to the design of novel cutting-edge keyword detection methods [T61I281 - I32] . 
Syntactical features have been employed to investigate the fundamental properties of 
the language from a physical standpoint [T6l[5ni|5l]. In the semantic/pragmatic level, 
concepts from Physics have also been used to investigate the ubiquity of ambiguous 
structures in texts [iniEniET]. 
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In the field of stylometry, the use of complex networks (CN) in textual models 
has become commonplace [T^ IT^ IT6| 1331135] . More specifically, several studies have 
modeled texts as co-occurrence (word adjacency) networks, where nodes and edges 
are represented by words and adjacency relationships, respectively. It has been shown 
that networks modeling texts share the same statistical properties of many other real 
systems [36]. Specially, such networks display both small-world and scale-free properties, 
as a consequence of the Zipf’s law. Practical studies involving co-occurrence networks 
have devised algorithms to generate summaries 1371 , to assess text coherence and 
cohesion [38], and to evaluate the quality of manual and machine translations [3^ . 
Even though word adjacency networks mostly grasp the syntactical factors of the 
language [16], it has been shown that they also convey semantic information [T0ll26| l^. 

While co-occurrence networks focus mainly on short scales, other physical models 
have been devised to capture long-range correlations. One of the most popular 
methods borrowed from the study of dynamical systems is the burstiness of word 
occurrences [29], which represents an attribute capable of capturing long-range textual 
features. Particularly, it has been shown that core words are unevenly distributed, 
while function words display distributions generated from random processes [3l]. Such 
findings have motivated the proposition of algorithms aiming to detect keywords in 
single texts [30] using level statistics [33] and information theory [29]. The long-range 
textual structure has also been studied at the character unigram level [^ITT]. 

Most of the research on textual pattern recognition has focused on the search for 
recurring patterns in order to infer a specific class to unknown instances [l2]. This 
approach has certainly worked well as many enlightening findings have been made this 
way. Despite the great number of studies on textual pattern recognition, the analysis of 
stylistic fluctuations along texts has received comparatively little attention. Empirical 
studies of some real systems have shown that fluctuations play a pivotal role on the 
unambiguous characterization of complex systems [1311T7] . For example, when topology 
is a relevant network feature, the most informative patterns might be hidden in outlier 
fluctuations [IH]. If one considers the distribution of word frequency, the fluctuations 
around the average might be useful to detect the most relevant concepts [29]. In 
biological systems, dynamical fluctuations of vital signals provide valuable information 
about the current state of the system BHI. 

Given the importance of the fluctuations in other real systems, the current paper 
presents a study on the properties of the stylistic variability along texts. Authors’ styles 
were characterized upon measuring the topological connectivity of networks modeling 
texts [33]. The stylistic evolution was quantified upon splitting the texts in shorter 
subtexts, which in turn were represented as smaller networks. From the topological 
analysis of these varying networks, several interesting finding could be found. First and 
foremost, it was possible to identify the correct authorship of texts from a multivariate 
analysis of the stylistic fluctuations along literary works. Interestingly, in this model, 
the variability of the average shortest path lengths along subtexts turned out be the 
most relevant feature for discriminating distinct authors. To identify the authorship 
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of books, the proposed model also took advantage of the intermittency of time series 
representing the spatial distribution of words. Similarly to the CN-based model, a 
signihcant accuracy rate in discriminating authorship was found. Surprisingly, when 
the intermittency of 100 functional words was employed as features of the classihers, 
the precise authorship could be found in 65% of the cases. As I shall show, the discovery 
of novel patterns related to the stylistic fluctuations in texts indicate that the proposed 
methodology can be extended to analyze other complex systems. 

This paper is organized as follows. In Section O the methods employed to represent 
texts as networks are presented. This section also swiftly presents the main topological 
measurements employed for the characterization of complex networks. In the same 
section, the intermittency concept is presented. In Section [3l the authorship recognition 
task is studied. In this case, the variability of complex network measurements along 
texts and the intermittency of specihc function words were employed as attributes of the 
classihers for the authorship recognition task. Finally, Section 0] presents perspectives 
for further research. 

2. Methods 

In this paper, the style of written texts was quantihed by measuring the topological 
properties of complex networks [60]. The representation of a text as a co-occurrence 
(word adjacency) network is detailed in Section [2Tl The topological features of complex 
networks employed to analyze the stylistic variation of texts are presented in Section 
12.21 An alternative model based on the spatial distribution of words is presented in 
Section 12.31 

2.1. Modeling texts as complex networks 

There are several ways to model texts as networks While semantic networks capture 
the relationships between word meanings, co-occurrence networks are more suitable to 
grasp stylistic attributes of written texts. As a matter of fact, co-occurrence networks 
represent a simplihed version of syntactic networks [T6| because most of the syntactic 
connections occurs between neighboring words [T61I56] . 

Prior to the creation of a co-occurrence network, some pre-processing steps are 
usually performed. Firstly, words conveying low semantic context (stopwords) are 
removed. Most of the words considered as stopwords are articles and prepositions 
(see the Supplementary Information). They are removed from the analysis because 
such words are mostly employed to connect other content words. After removing the 
stopwords, words with distinct spelling referring to the same concept are mapped to 
the same form. As a consequence, nouns and verbs are mapped to their singular and 
inhnitive forms, respectively [53]. To perform such mapping, it is imperative to solve 
ambiguities at the word level because the mapped form might depend upon the sense 
assumed for a given word in a given context. To assist the disambiguation algorithm. 
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Table 1. Example of pre-processing steps performed to create a co-occurrence 
network. Firstly, stopwords are removed (see step Then, the remaining words are 
converted to their canonical forms (see step As a consequence, nouns and verbs 
are mapped to their singular and infinitive forms, respectively. 


Original text 

Step ffl 

Step ff2 

In the middle of the road 

middle road 

middle road 

there was a stone there was 

stone 

stone 

a stone in the middle of 

stone middle 

stone middle 

the road there was a stone 

road stone 

road stone 

in the middle of the road 

middle road 

middle road 

there was a stone. Never 

stone never 

stone never 

should I forget this event 

I forget event 

I forget event 

in the life of my fatigued 

life fatigued 

life fatigue 

retinas. Never should I 

retinas never I 

retina never I 

forget that in the middle 

forget middle 

forget middle 

of the road there was a 

road 

road 

stone there was a stone 

stone stone 

stone stone 

in the middle of the road 

middle road 

middle road 

in the middle of the road 

middle road 

middle road 

there was a stone. 

stone 

stone 


the words are labeled with their respective parts-of-speech [53]. The labeling method 
employed is based on the maximum-entropy model proposed in [54] . 

After the pre-processing step, each distinct word becomes a node. Therefore, the 
total number of nodes in the network is equal to the vocabulary size (M) of the pre- 
processed text. The words that appear separated by up to d — 1 intermediate words 
are connected in the network. In this paper, the value d = 1 was used. Therefore, only 
adjacent words were connected. Table [T] illustrates the pre-processing steps taken to 
form a small network from the poem “In the middle of the road”, by Carlos Drummond 
de Andrade. The network obtained from the pre-processed form is shown in Figure [TJ 

2.2. Topological characterization of complex networks 

There are a myriad of measurements currently employed to characterize the topology 
of complex networks [55] . Traditional measurements can be classihed according to the 
amount of information needed for the computation. While local measurements only 
require information about the neighbors of a given node, global measurements require 
that the global network connectivity is known beforehand. There is also a third class: 
the quasi-local measurements. As the name suggests, quasi-local measurements require 
information about further neighbors (i.e. the nodes located two or more hops away 
from the node under analysis). The following list swiftly describes the measurements 
employed to analyze the topology of networks modeling texts. 
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FORGET 


MIDDLE 


ROAD 



Figure 1. Example of co-occurrence network created for the poem “In the middle 
of the road”, by Carlos Drummond de Andrade (see Table [T]). Note that, after the 
removal of stopwords, adjacent words are connected (see first column of Table [IJ. 


• Clustering coefficient: the clustering coefficient {C), a quasi-local measurement, 
quantifies the density of links between the neighbors of a given node. If Cj represents 
the number of edges between the neighbors of the node Vi and ki is the total number 
of neighbors of Vi, the clustering coefficient is given by Q = 2ci{kf — ki)~^. In co¬ 
occurrence networks, the clustering coefficient measures the number of contexts in 
which a given word appears [33]. While generic words tend to take low values of 
clustering coefficient, context-specific words usually take higher values of clustering 
coefficient [33] . 

• Average shortest path length: to define this global measurement, consider we are 
given Dij, the shortest distance between nodes u* and Vj. The average shortest path 
length of Vi is then given by 



( 1 ) 


In textual networks, this measurement quantifies the relevance of words. More 
specifically, a given word is considered relevant either when it is highly frequent or 
when it occurs close to the most relevant words [33] . 

• Betweenness: the betweenness {B) is a global measurement that measures the 
relevance of words [55]. To do so, the betweenness quantifies the number of shortest 
paths passing through a specific node. In textual networks, the betweenness also 
quantifies the number of contexts in which a word appear [33]. However, unlike 
the clustering coefficient, this measurement uses the global network information to 
infer the specificity of a word. 

• Accessibility: this measurement is an extension of the degree k | 37 |. To define the 
accessibility, consider that p\j^ is the probability of a random walker to go from 
node Vi to node Vj in h steps. Mathematically, the accessibility (a) is computed 
from the irregularity (entropy) of the distribution of 



( 2 ) 
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In general terms, the accessibility has been proven useful to identify the borders of 
complex networks when self-avoiding random walks are performed [37]. In textual 
networks, this measurement has been employed to identify keywords and to generate 
informative extractive summaries [37] . 


2.3. Intermittency 


In linguistic models, the effects of attraction and repulsion of words is an ever present 
phenomenon [3T],|57|. Several studies have shown that the distribution of many words 
along documents is not regular [29II32] . Particularly, keywords are usually unevenly 
distributed along texts [T61I281 - I32] . This hnding has motivated the design of keyword 
detection methods relying upon a single document [53] . To analyze the spatial 
distribution of words, each token is mapped to a element in a temporal series. The 
hrst word of the text represents the hrst element, the second word represents the second 
element and so forth. Given a word Wi occurring /j times in the text, the recurrence times 
of Wi generate the temporal series Ti = {ti, t 2 , h ,..., where ti is the distance (i.e. 

the number of intermediary words) between the hrst and second occurrence of Wi, t 2 
is the distance between the second and third occurrence of Wi and so on. Usually, 
two elements are added to the original temporal series Tp the space to until the hrst 
occurrence of Wi and the space tf^ after the last occurrence of Wi. The distribution of Ti 
might be characterized by the mean and standard deviation: 




1 

/i + 1 



i=0 


iV + 1 

/i + 1 ’ 


( 3 ) 


AT = 


\ 


1 rii 

1=0 


( 4 ) 


where N = fi- Given (T) and AT, the irregularity of the distribution T* is computed 
as 


I^ = 


at /, + 1 

(T) 


h 




( 5 ) 


The measurement dehned in eq. [5] is known as intermittency (or burstiness) of the 
distribution. It has been widely employed to detect keywords in texts as an alternative 
to the tf-idf technique [53] . In addition, the intermittency has proven relevant to detect 
keywords in genetic sequences [32] . 

A qualitative comparison of words taking distinct values of intermittency is provided 
in Figure [21 which shows the distribution of the words “Garm^Z/e” (/* = 54) and “/ee/” 
{fi = 54) along the book “Adventures of Sally”, by Pelham Grenville Wodehouse. 
Because the distribution of Carmylle" is much more irregular than the distribution of 
“/eeZ”, the former takes a much higher value of intermittency, as dehned in eq. |5l The 
burstiness revealed by “Garmy/Ze” also suggests that this word represents a relevant 
concept in the book [31]. An important property of the intermittency measurement 
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is that it does not correlate with the frequency (see Figure [S])- This means that 
the relevance assigned by the intermittency is not influenced by the word frequency. 
Taking advantage of this property, recent studies have combined both frequency and 
intermittency measurements to improve several keyword detection methods [T61I30] . 




TOKEN POSITION TOKEN POSITION 

(A) (B) 

Figure 2. Profile of spatial distribution along the book “Adventures of Sally”, by 
Pelham Grenville Wodehouse. The words considered were (a) “ Carmylle” \ and (b) 
“feel”. Note that the distribution of “Carmylle” is much more irregular than the 
distribution of “feel”. This suggests that “Carmylle” is much more relevant for the 
text than “feel”. 



Figure 3. Pearson correlation coefficient between intermittency and frequency in the 
books (a) “Roughing It”, by Mark Twain (r = —0.14); (b) “The woman in white”, by 
Wilkie Collins (r = 0.14); and (c) “Moby Dick”, by Herman Melville (r = —0.17). 


2.4- Pattern recognition methods 

The classihcation task aims at associating categories (or classes) to elements taking 
into account the attributes (or features) of these elements [65]. More specihcally, an 
attribute is a measurable property of objects. To illustrate the concept, suppose that, in 
a given application, one desires to classify people according to their physical attributes. 
In this case, the height, the skin and hair color, the weight and others factors could 













































































be selected as attributes. In many cases, the choice of discriminative and informative 
attributes plays an essential role on the performance of classihcation systems. Most of 
the attributes employed in traditional applications assume either numerical (e.g. 1, —7 
and 3.14) or categorical values (e.g. low and high). In the current study, one of the 
attributes employed to characterize texts is the intermittency of specihc words. In this 
case, the intermittency of each word represents a numeric attribute. 

The classihcation task is of paramount relevance for information retrieval 
applications. Particularly, in this paper, pattern recognition methods are used to 
capture the patterns emerging from the representation of texts as networks. Moreover, 
pattern recognition methods are used to quantify the discriminative ability provided by 
these patterns. Currently, there are several automatic classihcation methods. They are 
traditionally divided into the following groups: 

• Supervised classification: a binary relation mapping the input to the output is 
generated. 

• Unsupervised classification: a partition of the dataset is generated so that 
similar elements are clustered together. 

• Semi-supervised classification: the dataset available for automatic learning 
comprises a small set of labeled instances. Most of the instances is not labeled, 
i.e the class associated to these instances is lacking. In this case, the objective is 
to map the unlabeled input to a labeled output. 

Typically, supervised classihcation methods process two datasets. The training 
dataset is the set of examples used as input. In other words, it represents the 
set of examples whose classes is known beforehand. In this paper, the training 
set is represented as Str = {(3{tr,i), P{tr, 2 ), (3{tr,3) ■ ■ ■}■ The test dataset Sts = 
{(3{ts,i), f3{ts,3) ■ ■ ■} is the set used to evaluate the performance of the classiher. 

A given example f5 can be characterized by a set of M. features: = (Fi = F 2 = 

..., Fm = where F = {Fi, F 2 ,, F^} is the set of attributes characterizing 

the example (3. In other words, the k-th value taken by the attribute F^ in (3 is 
represented as (3^^h In a supervised classihcation, a given example assumes a single 
class Ci belonging to a hnite set C = {ci, C 2 ,...}. 

To quantify the quality of the classihcation, the cross validation technique was 
employed [69]. In this method, a fraction of the dataset is used to perform the training 
and another fraction is used to perform the evaluation. The implementation of this 
technique consists in splitting the training dataset in ten folders. Initially, nine folders 
are selected to train the classihers and the remaining one is used for evaluation. This 
process is repeated ten times so that a diherent folder is used for the evaluation in each 
iteration. Finally, the accuracy rate is computed as the average accuracy obtained over 
the ten iterations. The cross validation is considered a reliable index since the evaluation 
is performed over unknown instances. 

In the experiments, the analysis was performed on a dataset comprising books whose 
authorship is known beforehand. As a consequence, supervised classihcation methods 
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were employed to recognize patterns in the generated textnal time series. The methods 
employed in this study were: Bayesian Networks (BNT), Complement Naive Bayes 
(CNB), Naive Bayes (NVB), RBF Networks (RBF), Multi Layer Perceptron (MLP), 
Support Vector Machines (SVM), k Nearest Neighbors (KNN), C4.5 (C45) and Random 
Forest (RFO). A short introduction to these methods is provided in Appendix A. 

3. Results 

The stylistic properties of texts was studied in the context of the authorship recognition 
task. In this problem, one tries to recognize the identity of authors whose authorship 
is unknown. Owing to its central importance for stylometry, several contributions 
have been proposed |58]. Simple approaches include the analysis of word length and 
additional character features [67] . Mosteller and Wallace [68] proved that the frequency 
of function words (such as “and”, “any”, “ever”, “or”, “untiF and ‘^with”) can be 
employed to quantify the style of authors. More recently, many other approaches 
have been devised [5H], including those relying upon topological analysis of complex 
networks [SSHTOj. Here I use complex network and intermittency measurements to 
obtain potentially useful attributes for identifying the authors of books whose identity 
is lacking. Because this study focus on the analysis of stylistic fluctuations, the patterns 
displayed by the evolution of the statistical measurements along texts were studied. 

Two types of temporal series representing the stylistic evolution in books are 
studied. In Section 13.11 the relationship between the stylistic variation along books 
and the authorship recognition task is investigated. In Section 13.21 the intermittency 
prohle of some words across different authors is employed to perform the authorship 
recognition task. The dataset employed in the experiments comprises books written by 
8 authors, as shown in Table SI of the Supplementary Information. 

3.1. Stylistic variation along books 

In this section, I investigate whether the stylistic variation along texts provides useful 
attributes to the authorship recognition task. To quantify stylistic variations, the 
following methodology was taken. Each book in the dataset was split in subtexts 
comprising W tokens. Assuming that a book is formed by a sequence of tokens W = 
{wi,W 2 ,...}, the j-th subtext TJ will comprise the sequence {wsj,wsj+i, ■ ■ ■ ,WSj+w}, 
where Sj = W ■ j + 1 and j G {0,1,2,...}. Each subtext 7} was modeled as a 
complex network (see Section ED) and the topological measurements of each subtext 
were extracted (see Section IT^ . Thus, each topological measurement X generates a 
temporal series X = {xi,X 2 ,.. .xp}, where Xi represents the value obtained for X in 
the subtext 7} and P is the total number of subtexts. An example of X for X = (1) 
and X = (C) is provided in Figure 01 The temporal series X of each book was then 
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decomposed in terms of the Fourier transform; 

= J2xk exp ) (6) 

where = — 1. It is worth noting that the hrst component, given by 

nxh,,) = j:x,^p{x), 

k=l 

only stores information concerning the average (x). Higher frequencies and, therefore, 
higher levels of variation in X are represented in As attributes of the 

classihers, the hrst four components of <^{X) were used. 



SUBTEXT 



SUBTEXT 


Figure 4. Example of topological variation along the book “Great Expectation”, 
by Charles Dickens. The networks were formed using W = 1,300 tokens. The 
measurements considered were (a) the average shortest path length (/); and (b) the 
average clustering coefficient (C). 


The results obtained from the classihcation of authors are shown in Table O The 
length of the subtexts considered were W = {500, 700, 900, 1,100, 1,300}. For 
each subtext length, the table lists the accuracy rate obtained by the best classiher. 
The lowest accuracy rate occurred for W = 500 and the highest discriminability was 
achieved with W = 1, 300. In all cases, the performance obtained by the classihers 
was statistically signihcant, as revealed by low p-values. This result conhrms that 
the stylistic variations of authors along texts (quantihed via topological analysis of 
complex networks) can be employed to discriminate authors’ styles. Specially, the 
proposed method could be used as a complementary stylistic attribute, because the 
stylistic variation has been widely neglected as a relevant feature in current authorship 
attribution methods [58]. 

To verify the relative relevance of the features employed in the authorship 
recognition task, the information gain of each attribute in the training dataset was 
computed. Mathematically, the relevance ascribed by the information gain (fl) is 


n{Str,Fk) = n{Str)-n{Str\Fk), 


(7) 





11 


Table 2. Accuracy rate obtained in the classification based on the Fourier 
decomposition of time series of complex networks measurements. For each subtext 
length (IF), the table lists the accuracy rate obtained by the best classifier. 


w 

Method 

Accuracy 

p-value 

500 

BNT 

35.0% 

2.2 X 10“^ 

700 

CNB 

37.5% 

5.2 X 10-® 

900 

CNB 

40.0% 

1.1 X 10-5 

1,100 

RBF 

42.5% 

2.2 X 10-5 

1,300 

RFO 

45.0% 

4.0 X 10-^ 


where 'H{Str) is the entropy of the training dataset Str and 'H{Str\Fk) is the entropy of 
Str when Fk is specihed. 'H{Str\Fk) can be compnted from training dataset as 

W(S,.|n)= E \P[S;^v\-\S,A-'-HWlS, = v), (8) 

vGV(Fk) 

where | • | is the cardinality of the set and V{Fk) represents the set of all values 
taken by the attribute Fk in the training dataset, i.e. 

|5td 

V(F,) = U Ah)- p) 

i=l 

The rank of the most informative measurements, according to eq. [71 is shown in 
Table [3l In this table, the rows indicate the ranking obtained by the attributes, for 
each subtext length iW). All in all, the vocabulary size M turned out to be one of the 
most relevant attributes. The measurement displaying the highest relevance in higher 
components of the Fourier transform {j > 1 in eq. [6]) was the average shortest path 
length. More specihcally, the third component {j = 3) displayed the highest relevance 
for large values of W, suggesting that the attribute related to the variation of (/) along 
the text becomes even more relevant when larger subtexts are analyzed. Interestingly, 
this result reinforces the importance of shortest paths for the authorship attribution 
task, since this measurement has been successfully employed to characterize authors’ 
styles in networks formed from full books [33]. The relevance of higher components 
of the Fourier transform can also be noted in the decision tree built with the C4.5 
method [59] (see Figure[5|). Note that ^{{l))[ 2 ) and ^{M)( 2 ) appear at superior levels of 
the tree, conhrming thus their relevance. The relative importance of higher components 
becomes even more apparent if one observes that some traditional complex network 
measurements (i.e. ^(7f)(o)) correlates with the vocabulary size M [33]. This does not 
occur with ^((/))( 2 ), as revealed by the Pearson correlation coefficient displayed in Table 
m In fact, none of the higher components found in Table HI correlates signihcantly with 
other relevant traditional attributes (^(X)(o)), thus conhrming that higher components 
indeed provide novel information for characterizing styles in written texts. 

The results concerning the evolution of styles revealed that distinct authors might 
display distinct stylistic patterns along texts. This hnding is similar to the results found 
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Table 3. Relative importance of the attributes used in the classification based on 
the spectral decomposition of complex network measurements. The rows represent the 
ranking obtained for a given attribute. For example, the best attribute for IT = 1,300 
was =^((C'))(o) and the second best attribute for W = 1,300 was ^((M))(q). The 
measurements taking values of information gain below 0.500 are not shown. According 
to the information gain index, the third component of the average shortest paths 
lengths turned out to be one of the most informative measurement. 


# 

W=500 

W=700 

W=900 

W=l,100 

W=l,300 

1st 

^(Af)(o) 

0.772 

^(M)(o) 

0.812 

0.778 

^((0)(2) 

0.850 

n{c))io) 

0.778 

2nd 

J?({aPI))(„, 

0.669 

^((C'))(o) 

0.778 

'^(Af)(o) 

0.772 

^(Af)(o) 

0.772 

=^(Af)(o) 

0.772 

3rd 

^((0)(0) 

0.665 

nmo) 

0.772 

0.712 

^((0)(0) 

0.691 

0.712 

4th 

^({a'=>))(0) 

0.653 

^((a'”))(0) 

0.669 

0.663 

^((a®))(0) 

0.601 

n{i)h) 

0.608 

5th 

n{i)h) 

0.558 

0.669 

0.653 

0.601 

■np'pP) 

0.606 

6th 


^((0)(2) 

0.558 

^((0)(2) 

0.548 


0.601 

7th 





0.558 

8th 





^((a®»(2) 

0.510 


in [60] , which showed that the temporal evolution of stylistic features of books published 
between 1590 and 1922 is able to identify the traditional literary movements. The main 
feature differentiating this work from previous studies is that the stylistic variation 
inside books is much more subtle than the corresponding variation over different literary 
styles |6T|. The emergence of the described patterns suggests the applicability of other 
temporal models. Alternative models could probe, for example, the patterns present in 
the spatial distribution of character bigrams BDl. Particularly, this paper focus primarily 
on the evolution of stylistic patterns measured by the spatial distribution of words. For 
this reason, the next section investigates if the intermittency of specihc words serves as 
authors’ hngerprints for the authorship recognition task. 

S.2. Authorship recognition via word intermittency 

To verify if the uneven distribution of specihc words along texts provides useful features 
for characterizing authors’ styles, the following experiment was carried out. Following 
the research on stylometry, this study focused on function words. The 100 most frequent 
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3{C) 





Figure 5. Example of decision tree created to identify the authorship of books. To 
construct the tree, the C4.5 algorithm was employed in subtexts comprising W = 1, 300 
tokens. Note that the second component of the average shortest path length and 
vocabulary size and ^({M))( 2 )) are relevant as they appear at the top of 

the tree. 


Table 4. Pearson correlation coefficient |r| between and the most 

informative measurements found for IT = 1, 300 (see Table [3]). Because all correlations 
assume low values, the information conveyed by ^{X)^ 2 ) differs from the simple 
average (X) = ^(X)(o). 


X 

y 

\r{x,y)\ 

n{i)h) 


0.073 

mi)h) 


0.182 

mi)h) 


0.170 

mi)h) 


0.015 

mi)h) 


0.100 

^(M)(2) 


0.209 



0.041 



0.036 

^(M)(2) 


0.059 

^(M)(2) 


0.115 



0.010 

.^((a™»(2) 


0.042 

.^((a®»(2) 
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words in the corpus were considered as function words. As such, as attributes for the 
classihers, the interniittency of these function words was used. The best classiher, the 
Multilayer Perceptron, yielded an accuracy rate of 65.0% (p-value = 1.3 x 10“^^). This 
result suggests that, besides the frequency, the interniittency of specihc function words 
might be useful for characterizing authors’ styles in texts. Note that the discriminability 
obtained with intermittency features is not influenced by the frequency of function 
words, since there is no signihcant correlation between intermittency and frequency (see 
Section [T3i) . 

A detailed analysis of the classihcation revealed that most of the errors occurred 
for Arthur Conan Doyle, Wilkie Collins and Mark Twain (result not shown). If these 
authors are disregarded from the analyis, the use of intermittency features would provide 
an accuracy rate of 90% with the Multilayer Perceptron. Despite the large number 
of attributes employed for discriminating authors’ styles, the discriminative ability 
concentrated in a few function words. According to the information gain measurement, 
the words displaying the highest discriminative ability were ("H = 0.620), “and” 

{H = 0.604), {H = 0.530), “who” {U = 0.494) and “as” {U = 0.462). The high 
discriminability obtained with the intermittency of these hve words can be noted in the 
principal component analysis shown in Figure [6l 



^ MARK TWAIN 
O WILKIE COLLINS 
^ HECTOR MUNRO 
X CHARLES DARWIN 
+ SRAM STOKER 


FIRST PRINCIPAL COMPONENT 


Figure 6. Principal component analysis performed for the authorship recognition 
task. As attributes, only the five most informative features (the intermittency of 
“but", “and", “who” and “as") were employed to create the figure. 


In summary, one can conclude that the representation of specihc words as temporal 
series might be useful for the authorship recognition task. As commented in Section 
13.11 the use of intermittency of specihc words combined with traditional features might 
be useful to improve the performance of style-based real applications. In this case, an 
improved textual characterization would be provided, because the attributes generated 
from textual huctuations do not correlate with traditional features. Moreover, distinct 
classihers could be employed for each attribute type (e.g. frequency or intermittency), as 
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some classifiers perform better for specific attributes. As such, the classihcation becomes 
more robust and accurate without the hue tuning required in single models [^. The 
combination of attributes could be performed via ordinary voting of simple models m 
Another possibility is to consider fuzzy methods as independent classihers and then 
select the best weighting strategy for each classiher [26]. Furthermore, the successful 
application of intermittency measurements in characterizing authors’ styles suggests 
that complementary studies should be carried out in order to probe whether additional 
features of temporal series modeling the spatial distribution of words are able to reveal 
novel stylistic/topological patterns. 

4. Conclusion 

In this study, I investigated if measurements characterizing temporal series from 
texts are useful to identify authors’ styles. In the light of the results, one can 
conclude that authors’ stylistic properties can be characterized upon analyzing the 
fluctuations of textual statistical measurements. The statistically signihcant accuracy 
rates obtained in the authorship attribution task conhrmed that the features derived 
from the fluctuation of specihc topological and intermittency measurements are able 
to discriminate distinct authors. Using a co-occurrence network model, it was shown 
that the relative importance of distinct attributes may depend on the subtext length. 
Nevertheless, in general, further components of the Fourier decomposition of topological 
measurements turned out to be relevant features for the task. An analysis of the spatial 
distribution of specihc words revealed distinct patterns of distribution for different 
authors. Surprisingly, the intermittence of functional words correctly discriminate the 
authorship in 65% of the cases in a dataset comprising books written by 8 authors. 

The focus of this investigation was on the evaluation of distinct attributes 
for characterizing authors’ styles, rather than maximizing the accuracy rate of the 
classihcation. However, the dependence with stylistic attributes found for the proposed 
features suggests that attributes derived from the analysis of stylistic huctuations can be 
combined in a hybrid way with traditional attributes, such as the frequency of function 
words [68]. As such, the hndings reported in this paper shall potentially contribute 
to the improvement of current authorship recognition methods |S7|. One could pursue 
this line of analysis further, identifying the combination of features yielding the best 
discriminability. Future investigations could probe the relevance of huctuations in other 
related complex systems, such as DNA and other generic symbolic sequences, since the 
techniques described here can be extended in a straightforward fashion to such cases. 
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Appendix A. Pattern Recognition Methods 

This appendix swiftly describes the main pattern recognition methods employed in this 
study. A complete reference to the held of pattern recognition can be found in |66j . 

Decision trees 

Decision tree algorithms employ trees [62] to summarize the patterns recognized in the 
dataset (see Figure [AT]) . Typically, a decision tree comprises internal and leaf nodes. 
While internal nodes store the tests performed on specihc attributes, leaf nodes represent 
classes. The edges connect nodes according to the answers obtained from the tests. For 
example, the node representing the test Fi > 6i has two outgoing edges, namely “YES” 
and “NO”. During the classihcation stage, one travels through the tree until a leaf node 
is reached. In this case, the class associated to the leaf node is assigned to the unknown 
instance. The classihcation process is illustrated in Figure lATl 


classed 



Figure Al. Example of decision tree. To classify a new instance, one starts the walk 
at the root node. The class assigned to the unknown instance is the class associated 
to the leaf node found at the end of the walk. 

To construct a decision tree, at each step, one tries to hnd an attribute Fi and 
a threshold 6 so that the test Fi > 9 yields the best dataset partition. One assumes 
that the quality of a partition is proportional to the discriminability provided by that 
partition. At each division, the goal is to separate one or more classes in distinct groups. 
Several measurements have been proposed to quantify the quality of partitions. An well- 
known measurement is the Kullback-Leibler divergence [63] . The process of choosing the 
attribute with the highest information gain is reiterated for the two subsets created at 
each internal node. The recursion is hnalized when a subset contains instances belonging 
to a single class. In this case, a leaf node is created to store the corresponding class. 
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The tree-based algorithms employed in this paper were the C4.5 and Random 
Forest. Further details regarding these methods can be found in |69j . 


Bayesian decision 


To classify a new instance, the Naive Bayes algorithm estimates the probability 
distribution of each class q G C. Given the likelihood prohle of each class, the algorithm 
employs the maximum a posteriori strategy to infer the correct class. The probability 
of each c* G G to be assigned to the instance (3 is 

P0) 

F(Fl = /3W,...,F^, = /j(^)|Q)P(Q) 

F(Fi = /3d),...,F^ = /3(A^)) • 

Note that F(cj) can be estimated as A/'(cj)/where Af{ci) is the number of 
objects in Str belonging to class Cj. For classihcation purposes, the quantity P(j^) can 
be disregarded from the analysis because P(~^\ci) is constant for all q G C. Finally, in 
order to estimate P(~^ |cj), the traditional Naive Bayes classiher surmises independence 
between the features. Hence P(~^\ci) is estimated as 

P{$ Iq) = F(Fi = /?(!), ...,Fm = 


M 




k=l 


Using the value of F('^|cj), it is possible to replace it in the dehnition of P{ci\~^). 
Therefore 






Upon using the maximum a posteriori rule, the class Cs can be estimated as 


M 


cy = argmaxF(ci) 

Ci GC 


k=l 


To obtain cy from the above equation, one must estimate the likelihood P{Fk\ci). Several 
methods have been proposed to perform the estimation [6lj. The Parzen-Rosenblatt 
window algorithm has been widely employed as a non-parametric technique to estimate 
probability densities [64] . 

In addition to the Naive Bayes, the algorithms based on statistical paradigms 
employed in this study were the Complement Naive Bayes and Bayesian Networks. 
More details concerning these methods can be found in [69] . 


Neural Networks 

The simplest artihcial neural network (ANN) model is the Perceptron. In this model, 
each neuron stores activation and transfer functions. While the former sums (with 
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Figure A2. Example of a single neuron. We are given some input signals at 
and expected outputs in a supervised classification. The learning algorithm aims at 
minimizing the error between the actual and expected output signals. 


weights) the input signals, the latter yields an output signal as a function of the input. 
Figure IA2I illustrates a single neuron with input signals and weights represented as a* 
and Wi, respectively. The output s is s = + b. The transfer function cf) may 

assume many distinct forms [65] . A very simple possibility is to consider that the neuron 
is activated whenever s surpasses a given threshold, i.e. 



for s > 0, 
otherwise. 


(A.l) 


The correct choice of synaptic weights in neural networks allows the network to 
effectively process the input signals in order to generate the expected output. In 
general, the weights are assigned by learning algorithms [65]. Initially, the values Wij of 
weights linking the i-th node of the input layer with the j-th node of the output layer 
assume random values. Given these initial weights, several input signals are presented 
to the neuron. Then, the obtained output is compared with the expected values. If 
the observed error exceeds a given threshold, the current weights are modihed by the 
learning algorithm. In this case, the larger the error obtained, the greater is the change 
applied to the current weights. More specihcally, weights are updated according to the 
rule + rjEjXi, where rj is the learning rate and Sj is the error obtained for 

the j-th neuron. 

The ANN-based pattern recognition methods employed in this study were the 
Multilayer Perceptron and the RBF network. More details concerning these methods 
can be found in [65]. 
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